Wednesday, December 20, 2017

Improving Data Quality with Machine Learning Techniques.



https://data.bloomberglp.com/promo/sites/12/750171296-FinDataGovernance.pdf

============================================================
https://www.bis.org/ifc/events/ifc_nbb_workshop/ifc_nbb_workshop_2d3.pdf

Improving Data Quality and Closing Data Gaps
with Machine Learning
Tobias Cagala*
May 5, 2017

=======================================================================


http://conteudo.icmc.usp.br/pessoas/gbatista/files/aai2003.pdf


An Analysis of Four Missing Data
Treatment Methods for Supervised Learning
Gustavo E. A. P. A. Batista and Maria Carolina Monard
University of S˜ao Paulo - USP
Institute of Mathematics and Computer Science - ICMC
Department of Computer Science and Statistics - SCE
Laboratory of Computational Intelligence - LABIC
P. O. Box 668, 13560-970 - S˜ao Carlos, SP, Brazil
{gbatista, mcmonard}@icmc.usp.br

==================================================================


http://dimacs-algorithmic-mdm.wdfiles.com/local--files/start/Methodologies%20for%20Data%20Quality%20Assessment%20and%20Improvement.pdf

Methodologies for Data Quality Assessment and Improvement
CARLO BATINI
Universit`a di Milano - Bicocca
CINZIA CAPPIELLO
Politecnico di Milano
CHIARA FRANCALANCI
Politecnico di Milano
and
ANDREA MAURINO
Universit`a di Milano - Bicocca


====================================================================

https://www.waterstechnology.com/reference-data-data-management/data-governance/3309956/webinar-making-enterprise-data-quality-a-reality

Webinar: Making Enterprise Data Quality a Reality


======================================================================

https://www.waterstechnology.com/innovationhub


=========================================================

https://www.bloomberg.com/professional/blog/machine-learning-plays-critical-role-improving-data-quality/

=============================================================


Monday, December 11, 2017

DATA QUALITY STRATEGY

DATA QUALITY STRATEGY: A STEP-BY-STEP APPROACH

“Strategy is a cluster of decisions centered on goals that determine what actions to take and how
to apply resources.”

Certainly a cluster of decisions – in this case concerning six specific factors
– will need to be made to effectively improve the data. Corporate goals will determine
how the data is used and the level of quality needed. Actions are the processes improved and invoked
to manage the data. Resources are the people,systems, financing, and data itself.
We can therefore apply the selected definition in the context of data,and arrive at the definition
of data quality strategy:

“A cluster of decisions centered on organizational data quality goals that determine the data
processes to improve, solutions to implement, and people to engage.”

EXECUTIVE SUMMARY
This paper will discuss:
• Goals that drive a data quality strategy
• Six factors that should be considered when building a strategy
• Decisions within each factor
• Actions stemming from those decisions
• Resources affected by the decisions and needed to support the actions.

You will see how these six factors — when added together in different combinations — provide the
answer as to how people, process and technology are the integral and fundamental elements of
information quality.

GOALS OF DATA QUALITY
Goals drive strategy. Data quality goals must support on-going functional operations, data management
processes, or other initiatives such as the implementation of a new data warehouse (DW), CRM
application, or loan processing system.

THE SIX FACTORS OF DATA QUALITY

When creating a data quality strategy there are six factors, or aspects of an organization’s operations that
must be considered. Those six factors include:
• Context — the type of data being cleansed and the purposes for which it is used
• Storage — where the data resides
• Data Flow — how the data enters and moves through the organization
• Work Flow — how work activities interact with and use the data
• Stewardship— people responsible for managing the data
• Continuous Monitoring — processes for regularly validating the data Figure 1 depicts the six factors centered on the goals of a data quality initiative, and shows that each factor requires decisions to be made, actions that need to be carried, and resources to be allocated.

TYING IT ALL TOGETHER

In order for any strategy framework to be useful and effective it must be scalable. The strategy framework provided here is scalable from a simple one-field update such as validating gender codes of male and female, to an enterprise-wide initiative where 97 ERP systems need to be cleansed and consolidated into 1 system. To ensure the success of the strategy, and hence the project, each of the six factors must be evaluated. The size (number of records/rows) and scope (number databases, tables, and columns) determines the depth to which each factor is evaluated.

Taken all together or in smaller groups, the six factors act as operands in data quality strategy formulas.

• Context by itself = The type of cleansing algorithms needed
• Context + Storage + Data Flow + Work Flow = The types of cleansing and monitoring technology implementations needed
• Stewardship + Work Flow = Near-term personnel impacts
• Stewardship + Work Flow + Continuous Monitoring = Long-term personnel impacts
• Data Flow + Work Flow + Continuous Monitoring = Changes to processes It is a result of using these formulas the people come to understand that information quality truly is the integration of people, process, and technology in the pursuit of deriving value from information assets. 



To help the practitioner employ the data quality strategy methodology,
the core practices have been extracted from the factors and listed here.
a) A statement of the goals driving the project
b) A list of data sets and elements that support the goal
c) A list of data types and categories to be cleansed(1)
d) A catalog, schema or map of where the data resides(2)
e) A discussion of cleansing solutions per category of data(3)
f) Dataflow diagrams of applicable existing dataflows
g) Work flow diagrams of applicable existing work flows
h) A plan for when and where the data is accessed for cleansing(4)
i) A discussion of how the dataflow will change after project implementation
j) A discussion of how the workfflow will change after project implementation
k) A list of stakeholders affected by the project
l) A plan for educating stakeholders as to the benefits of the project
m) A plan for training operators and users
n) A list of data quality measurements and metrics to monitor
o) A plan for when and where to monitor(5)
p) A plan for initial and then regularly scheduled cleansing

• Stewardship— people responsible for managing the data
• Continuous Monitoring — processes for regularly validating the data
Figure 1 depicts the six factors centered on the goals of a data quality initiative, and shows that each
factor requires decisions to be made, actions that need to be carried, and resources to be allocated.

Credit goes to Frank Dravis
https://pdfs.semanticscholar.org/5b5b/a15e8ea1bd89fe4d14d5e97ce456436291e0.pdf





Sunday, November 12, 2017

Functional Programming

https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-001-structure-and-interpretation-of-computer-programs-spring-2005/video-lectures/


https://www.youtube.com/watch?v=7Zlp9rKHGD4


https://pragprog.com/magazines/2013-01/functional-programming-basics

https://maryrosecook.com/blog/post/a-practical-introduction-to-functional-programming



===============================================================

Functional Programming based on lambda calculus.
It is without assignment statements.

statement has a side effect but where as expression has no side effects.

Expression based programming paradigm , using expressions as opposed to statements and combining expressions to form functions and combing functions to form complex behaviors.

=================================================================

Funtional Programming Jargon


Pure Functions


Immutability


Recursion


No Side Effects


Higher Order Functions


Category Theory


Lambda Calculus


Currying


Type Strictness : Int is type without classes but where string is a type with class.

Saturday, October 28, 2017

Best Content on Spark

https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-Davidson.pdf

https://spark-summit.org/2014/wp-content/uploads/2015/03/SparkSummitEast2015-AdvDevOps-StudentSlides.pdf

Following is the Summary of the the above 2 pdf documents

Spark Major core components: 

– Execution Model
– The Shuffle
– Caching

Spark Execution Model 

1. Create DAG of RDDs to represent computation
2. Create logical execution plan for DAG
        Pipeline as much as possible
     • Split into “stages” based on need to reorganize data.
3. Schedule and execute individual tasks
        Split each stage into tasks
      • A task is data + computation
      • Execute all tasks within a stage before moving on
---------------------------------------------







“The main abstraction in Spark is that of a resilient distributed dataset (RDD), which represents a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

Users can explicitly cache an RDD in memory across machines and reuse it in multiple MapReduce-like parallel operations.

RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition.”


“We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.

RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools.

In both cases, keeping data in memory can improve performance by an order of magnitude.”


An RDD can be created 2 ways: - 

1)  Parallelize a collection -
2)  Read data from an external source (S3, C*, HDFS, etc)

Life Cycle of a Spark Program

1) Create some input RDDs from external data or parallelize a collection in your driver program.

2) Lazily transform them to define new RDDs using transformations like filter() or map()

3) Ask Spark to cache() any intermediate RDDs that will need to be reused.

4) Launch actions such as count() and collect() to kick off a parallel computation, which is then optimized and executed by Spark.


Transformations (Lazy)

map()
intersection()
cartesion()
flatMap()
distinct()
pipe()
filter()
groupByKey()
coalesce()
mapPartitions()
reduceByKey()
repartition()
mapPartitionsWithIndex()
sortByKey()
partitionBy()
sample()
join() ...
union()
cogroup() ...

(lazy) - Most transformations are element-wise (they work on one element at a time), but this is not true for all transformations

Actions

reduce()
takeOrdered()
collect()
saveAsTextFile()
count()
saveAsSequenceFile()
first()
saveAsObjectFile()
take()
countByKey()
takeSample()
foreach()
saveToCassandra()

Types of RDDs

HadoopRDD • FilteredRDD • MappedRDD • PairRDD • ShuffledRDD • UnionRDD • PythonRDD • DoubleRDD • JdbcRDD • JsonRDD • SchemaRDD • VertexRDD • EdgeRDD


RDD Interface

1) Set of partitions (“splits”)
2) List of dependencies on parent RDDs
3) Function to compute a partition given parents
4) Optional preferred locations
5) Optional partitioning info for k/v RDDs (Partitioner)

Example: HADOOPRDD

Partitions = one per HDFS block
Dependencies = none
Compute (partition) = read corresponding block
preferredLocations (part) = HDFS block location
Partitioner = none

Example : FilteredRDD

Partitions = same as parent RDD
Dependencies = “one-to-one” on parent
Compute (partition) = compute parent and filter it
preferredLocations (part) = none (ask parent)
Partitioner = none


Example : JoinedRDD

Partitions = One per reduce task
Dependencies = “shuffle” on each parent
Compute (partition) = read and join shuffled data
preferredLocations (part) = none
Partitioner = HashPartitioner(numTasks)


Memory and Persistance

                                                         Worker Machine

Recommended to use at most only 75% of a machine’s memory
for Spark Minimum Executor heap size should be 8 GB Max
Executor heap size depends… maybe 40 GB (watch GC)
Memory usage is greatly affected by storage level and serialization format

- If RDD fits in memory, choose MEMORY_ONLY
- If not, use MEMORY_ONLY_SER w/ fast serialization library
- Don’t spill to disk unless functions that computed the datasets are very expensive or they filter a large amount of data. (recomputing may be as fast as reading from disk)
- Use replicated storage levels sparingly and only if you want fast fault recovery (maybe to serve requests from a web app)
- Intermediate data is automatically persisted during shuffle operations.

Spark uses Memory for 

RDD Storage: when you call .persist() or .cache(). Spark will limit the amount of memory used when caching to a certain fraction of the JVM’s overall heap, set by spark.storage.memoryFraction

Shuffle and aggregation buffers: When performing shuffle operations, Spark will create intermediate buffers for storing shuffle output data. These buffers are used to store intermediate results of aggregations in addition to buffering data that is going to be directly output as part of the shuffle.

User code: Spark executes arbitrary user code, so user functions can themselves require substantial memory. For instance, if a user application allocates large arrays or other objects, these will content for overall memory usage. User code has access to everything “left” in the JVM heap after the space for RDD storage and shuffle storage are allocated.


Serialization is used when:




Jobs       ---------------->          Stages  ---------->     Tasks




Schedulers




Lineage

“One of the challenges in providing RDDs as an abstraction is choosing a representation for them that can track lineage across a wide range of transformations.”

“The most interesting question in designing this interface is how to represent dependencies between RDDs.”

“We found it both sufficient and useful to classify dependencies into two types:
narrow dependencies, where each partition of the parent RDD is used by at most one partition of the child RDD
wide dependencies, where multiple child partitions may depend on it.”



Stages





Dependencies Narrow vs Wider

“This distinction is useful for two reasons:

1) Narrow dependencies allow for pipelined execution on one cluster node, which can compute all the parent partitions. For example, one can apply a map followed by a filter on an element-by-element basis. In contrast, wide dependencies require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation.

2) Recovery after a node failure is more efficient with a narrow dependency, as only the lost parent partitions need to be recomputed, and they can be recomputed in parallel on different nodes. In contrast, in a lineage graph with wide dependencies, a single failed node might cause the loss of some partition from all the ancestors of an RDD, requiring a complete re-execution.”



How do you know if a shuffle will be called on a Transformation?

-   repartition , join, cogroup, and any of the *By or *ByKey transformations can result in shuffles
-   If you declare a numPartitions parameter, it’ll probably shuffle
-   If a transformation constructs a shuffledRDD, it’ll probably shuffle
-   combineByKey calls a shuffle (so do other transformations like groupByKey, which actually
    end up calling combineByKey)

Common Performance issue checklist 

1. Ensure enough partitions for concurrency
2. Minimize memory consumption (esp. of sorting and large keys in groupBys)
3. Minimize amount of data shuffled 4. Know the standard library



Spark supports 2 types of shared variables: 

• Broadcast variables – allows your program to efficiently send a large, read-only value to all the worker nodes for use in one or more Spark operations. Like sending a large, read-only lookup table to all the nodes.

Broadcast variables let programmer keep a readonly variable cached on each machine rather than shipping a copy of it with tasks

For example, to give every node a copy of a large input dataset efficiently

Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost

• Accumulators – allows you to aggregate values from worker nodes back to the driver program. Can be used to count the # of errors seen in an RDD of lines spread across 100s of nodes. Only the driver can access the value of an accumulator, tasks cannot. For tasks, accumulators are write-only.


Accumulators are variables that can only be “added” to through an associative operation.
Used to implement counters and sums, efficiently in parallel.
Spark natively supports accumulators of numeric value types and standard mutable collections, and programmers can extend for new types.
Only the driver program can read an accumulator’s value, not the tasks

Spark usecases