Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement: Split-Apply-Combine with LINQ #16

Open
mratsim opened this issue Mar 1, 2017 · 0 comments
Open

Implement: Split-Apply-Combine with LINQ #16

mratsim opened this issue Mar 1, 2017 · 0 comments

Comments

@mratsim
Copy link

mratsim commented Mar 1, 2017

First of all: This is awesome! I'm willing to test the hell out-of-this library.

This is related to: #13

The split-apply-combine pattern to group data, and apply transformations on the original data based on aggregate functions on the grouped data (sum, count, max for each group for example) on the original dataframe is extremely useful for datascience.

Regarding the syntax: Python's pandas and R's dplyr uses a custom syntax. Julia's DataFramesMeta uses LINQ.

I believe LINQ approach is much more powerful in term of extensibility and could have a very nice Pipe syntax in Nim. Also this could expand well to multiple backends (SQL, Hadoop, Feather ...)

Here are two sample Scikit Transformers to compute the Ticket frequency on the Kaggle Titanic dataset to showcase LINQ split-apply-combine vs Pandas

Python

class PP_TicketFreqTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.df_Ticket = pd.DataFrame()

    def fit(self, X, y=None, **fit_params):
        self.df_Ticket = pd.DataFrame(X)
        return self

    def transform(self, X):
        dfX = pd.DataFrame(X)
        df = dfX
        if not df.equals(self.df_Ticket):
            df = pd.concat([self.df_Ticket, df]).reset_index()
        return dfX.assign(
            CptTicketFreq = df.groupby('Ticket')['Ticket']
                            .transform('count')
        )

Julia -- Piping is done with |>

type PP_TicketFreqTransformer <: ScikitLearnBase.BaseEstimator
    df_TicketFreq::DataFrame
    PP_TicketFreqTransformer() = new()
end

@declare_hyperparameters(PP_TicketFreqTransformer, Symbol[])

function ScikitLearnBase.fit!(self::PP_TicketFreqTransformer, X::DataFrame, y=nothing)
    self.df_TicketFreq = X
    return self
end

function ScikitLearnBase.transform(self::PP_TicketFreqTransformer, X::DataFrame)
    df = ifelse(isequal(X,self.df_TicketFreq), X, vcat(X,self.df_TicketFreq))
    @linq  df |>  by(:PassengerId,CptTicketFreq = length(:Ticket)) |>
        join(X, on=:PassengerId)
end

Full codes, in Python, and Julia

Here is a link to a LINQ idea on Nim's forum as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant