Implement: Split-Apply-Combine with LINQ #16

mratsim · 2017-03-01T08:36:52Z

First of all: This is awesome! I'm willing to test the hell out-of-this library.

This is related to: #13

The split-apply-combine pattern to group data, and apply transformations on the original data based on aggregate functions on the grouped data (sum, count, max for each group for example) on the original dataframe is extremely useful for datascience.

Regarding the syntax: Python's pandas and R's dplyr uses a custom syntax. Julia's DataFramesMeta uses LINQ.

I believe LINQ approach is much more powerful in term of extensibility and could have a very nice Pipe syntax in Nim. Also this could expand well to multiple backends (SQL, Hadoop, Feather ...)

Here are two sample Scikit Transformers to compute the Ticket frequency on the Kaggle Titanic dataset to showcase LINQ split-apply-combine vs Pandas

Python

class PP_TicketFreqTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.df_Ticket = pd.DataFrame()

    def fit(self, X, y=None, **fit_params):
        self.df_Ticket = pd.DataFrame(X)
        return self

    def transform(self, X):
        dfX = pd.DataFrame(X)
        df = dfX
        if not df.equals(self.df_Ticket):
            df = pd.concat([self.df_Ticket, df]).reset_index()
        return dfX.assign(
            CptTicketFreq = df.groupby('Ticket')['Ticket']
                            .transform('count')
        )

Julia -- Piping is done with |>

type PP_TicketFreqTransformer <: ScikitLearnBase.BaseEstimator
    df_TicketFreq::DataFrame
    PP_TicketFreqTransformer() = new()
end

@declare_hyperparameters(PP_TicketFreqTransformer, Symbol[])

function ScikitLearnBase.fit!(self::PP_TicketFreqTransformer, X::DataFrame, y=nothing)
    self.df_TicketFreq = X
    return self
end

function ScikitLearnBase.transform(self::PP_TicketFreqTransformer, X::DataFrame)
    df = ifelse(isequal(X,self.df_TicketFreq), X, vcat(X,self.df_TicketFreq))
    @linq  df |>  by(:PassengerId,CptTicketFreq = length(:Ticket)) |>
        join(X, on=:PassengerId)
end

Full codes, in Python, and Julia

Here is a link to a LINQ idea on Nim's forum as well

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement: Split-Apply-Combine with LINQ #16

Implement: Split-Apply-Combine with LINQ #16

mratsim commented Mar 1, 2017 •

edited

Loading

Implement: Split-Apply-Combine with LINQ #16

Implement: Split-Apply-Combine with LINQ #16

Comments

mratsim commented Mar 1, 2017 • edited Loading

mratsim commented Mar 1, 2017 •

edited

Loading